Please insert disk labeled Windows XP Professional CD-ROM into Drive A:



Stop Scraping my Cgit!


Created: 2025-08-03 | Modified: 2025-08-07

Blocking scapers in nftables firewall

When you self host, and expose a git server to the internet, you'll find your access log filled with scraperrs. Hence I've had the following in my git nginx config to ask the bots to kindly fuck off. This stops a lot of bots, who respect this

        location /robots.txt {
                return 200 "User-agent: * # match all bots
Disallow: / # keep them out";
        }

Albeit after reading the blog post Stop Scraping my Git Forge! - notashelf.dev i thought let's take another look, and would you look at that lot's entries like the following:

47.79.213.166 - - [03/Aug/2025:02:12:22 +0200] "GET /firmware/sonix-qmk/diff/keyboards/qwertyydox/keymaps/default?id=e7cc5a35c2b80d081207db940777b7537d30a5cd&id2=9808bfaf2616afbe837873d962bc214be3705f90 HTTP/1.1" 403 186 "https://www.google.com/" "Mozilla/5.0 (Linux; Android 10; K) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/133.0.0.0 Mobile Safari/537.36"
101.44.71.209 - - [03/Aug/2025:02:12:26 +0200] "GET /firmware/qmk/commit/keyboards/handwired/k_numpad17/config.h?id=1eb70be4579e3888ea665fec5706b03eac3d2b3e HTTP/2.0" 403 175 "https://git.node5.net/firmware/qmk/commit/keyboards/handwired/k_numpad17/config.h?id=1eb70be4579e3888ea665fec5706b03eac3d2b3e" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36"
$ whois 101.44.71.209 | grep netname
OrgName:        Huawei-Cloud-HK
$ whois 47.79.213.166 | grep Organization
OrgName:        Alibaba Cloud LLC (AL-3)

If you look up the IP(s) on bgp.he.net you can find all associated IP prefixes If you copy the text of this page to a text file and grep with this pattern: source

grep -E -o "([0-9]{1,3}[\.]){3}[0-9]{1,3}.*$"

You can get all the IPv4 ranges.


Blocking

Nginx

You can import a file e.g. under the server block with: include /etc/nginx/sites-available/blocklist.conf;

blocklist.conf:

# AS136907 HUAWEI CLOUDS
deny 1.178.32.0/20;
deny 1.178.48.0/20;
...

This however will still fill your access logs...

Nftables

Even better you can block these IPs entirely with NFTables

In /etc/nftables.conf add the following: source

include "nftables_blocklist.conf"

table inet filter {

        set blocklist {
                type ipv4_addr; flags interval;
                auto-merge
                elements = $blocklist
        }

        chain input_world {
                ip saddr @blocklist counter drop
...

nftables_blocklist.conf

define blocklist = {
        1.178.32.0/20, # AS136907 HUAWEI CLOUDS
        1.178.48.0/20,
...
sudo nft list ruleset | grep '@blocklist'
        ip saddr @blocklist counter packets 29 bytes 1732 drop

Git commits = LLM training data

On a side note i think LLM companies are scraping or are going to scrape git repos heavily, since a good git commit basically works as a recipe on how to complete an isolated task, so long as they're able to rank the input data quality, as the model is only as good as the input data, and there's a lot of noise in a lot of the data.


Comments













(Will await approval before becoming public)

sqlite> SELECT COUNT(comment) FROM comment WHERE page_url = '/Stop Scraping my Cgit!';
0